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Abstract 



We propose a novel approach for density estimation with exponential families for the case when the true 
density may not fall within the chosen family. Our approach augments the sufficient statistics with features 
designed to accumulate probability mass in the neighborhood of the observed points, resulting in a non- 
parametric model similar to kernel density estimators. We show that under mild conditions, the resulting 
model uses only the sufficient statistics if the density is within the chosen exponential family, and asymptot- 
ically, it approximates densities outside of the chosen exponential family. Using the proposed approach, we 
modify the exponential random graph model, commonly used for modeling small-size graph distributions, 
to address the well-known issue of model degeneracy. 



1 Introduction 



The problem of density estimation is ubiquitous in machine learning and statistics. A typical approach 
would assume a parametric family for the distribution from which the observed data is drawn and estimate 
the parameters by fitting them to the data. Among the parametric families, exponential families play a promi- 
nent role, as maximum likelihood estimation from complete data for exponential families is asymptotically 
unbiased, consistent, and efficient (van der Vaart, 1998, Chapter 5). Finding the maximum likelihood esti- 
mator (MLE) reduces to a convex optimization problem that requires knowing only the sufficient statistics 
from the data (Barndorff-Nielsen, 1978, Brown, 1986). When the functional form is not readily available, 
non-parametric approaches (e.g., kernel density estimation) provide a convenient alternative by allowing 
the number of parameters to grow with the available data. However, if the number of components in the 
data vectors is large relative to the number of the available data points, non-parametric approaches may 
suffer from the curse of dimensionality (Bellman, 1957) and overfit. From the standpoint of a bias-variance 
tradeoff, a parametric approach can be useful even if the number of observations is not large (as statistics 
of the data can often be estimated from relatively few samples) but could also be hindered by a misspeci- 
fication bias if the true distribution falls outside the chosen parametric family. Conversely, non-parametric 
approaches can approximate any density with enough data, suffering a high variance when the data is limited 
or the dimensionality is even moderately large. 

We propose a novel non-parametric density-estimation approach for exponential families that combines 
some of the strengths of parametric and non-parametric approaches. Our approach draws inspiration from 
kernel density estimators (KDEs), which approximate unknown densities by placing probability mass around 
the observations, and from the exponential families by imposing global constraints in matching the statistics. 
Exponential families are derived by maximizing the Shannon's entropy of the estimated distribution subject 
to the constraint that the expected values of the chosen statistics (features) with respect to the empirical 
and the estimated distributions must match. Our proposed exponential family model imposes additional 
constraints requiring a small constant probability mass around each example point. We accomplish this by 
augmenting the set of given statistics (features) with kernel functions centered around the observations, so 
that the expected value for each of these functions represents the probability mass concentrated around an 
example point. The resulting exponential family model is non-parametric, as each data point has a parameter 
associated with it. The objective function for the parameter estimation is convex and contains an l\ -penalty 
term for each added parameter. These penalties encourage sparsity by potentially making many of the added 
parameters vanish. We show that if the true distribution is within the exponential family model with the 
chosen statistics, then as the sample size increases, all parameters associated with the added local features 
vanish and our approach converges to the true distribution. If the true distribution is not from the chosen 
exponential family, then, our approach provides a close approximation to the unknown density, comparable 
to KDEs. 

Our work is in part motivated by a problem of learning distributions over graphs from examples of 
observed networks, typically from a single network. Such data arises in many domains, including social 
sciences, bioinformatics, and systems sciences. Among the approaches to this problem, one of the perhaps 
most-studied is the exponential random graph model (ERGM, or in the social network literature, p*, e.g., 
Frank and Strauss, 1986, Holland and Leinhardt, 1981, Wasserman and Pattison, 1996). ERGMs use graph 
statistics as features to define an exponential family distribution over all possible graphs with a given number 
of nodes. Such models have the desirable property of learning a distribution that matches the observed graph 
statistics. However, ERGMs often suffer from issues of degeneracy (Handcock, 2003, Lunga and Kirshner, 
201 1, Rinaldo et al., 2009) manifested in placing most of the probability mass on unrealistic graphs (e.g., an 
empty or a complete graph), very dissimilar to the observed graph(s). As an illustration of our approach, we 
propose a modification to ERGMs which alleviates the above issue of degeneracy in moderate-sized graphs. 
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The main contributions of this paper are a novel framework for non-parametric estimation of densities 
with exponential family models that is applicable when the number of data points is relatively small, analysis 
of of its convergence properties, and a modification of ERGMs that remedies one of the degeneracy issues. 
The paper is structured as follows. In Section 2, we briefly describe the exponential family models. In 
Section 3 we introduce the features we use to constrain the probability mass around the data points and 
derive a formulation for a new model from first principles. In Section 3.3, we derive some of the new model's 
properties, and then discuss the resulting parameter-estimation optimization problem and our approach to 
solving it in Section 3.4. In Section 4, we propose a new model for distributions over networks with a 
moderate number of nodes. We explore the properties of our estimator for 1 -dimensional densities and for 
modeling network data via an empirical study in Section 5, and finally discuss our findings and outline 
possible future directions in Section 6. 



2 Exponential Family 

We briefly introduce the exponential family of distributions before describing our contribution, a non- 
parametric exponential family. 

Suppose X is a vector of random variables with support X C R m . A distribution for X belongs to the 
exponential family of distributions with sufficient statistics t : X — > % C M d , if its probability density has 
a functional form: 1 

f E ( X \ X ) = ^7TT<? ( x ) ex P ( A > * ( x )) where 

ZW (1) 



Z (X) = / q (x) exp (A, t (x)) dx < oo 
Jx 



is a partition function, A is a vector of canonical parameters, q : X — > R is a base measure, and (•, •) denotes 
the Euclidean inner product. We further assume that the exponential family is regular (i.e. the canonical 
parameter space is open). Assuming q is fixed, let ETt denote the set of all possible distributions of the 
form (1) with the set of sufficient statistics t. 

Given samples x 1:n = (x 1 , . . . , x n ) / where / : X — > R is an unknown density with the same 
support as q. Let f n : X — > E be the empirical distribution for x lm , f n (x\x 1:n ) = ^ Y17=l $ ( a; *) wnere 
5 (x) is a Dirac delta function. Exponential families can be obtained as a solution to the optimization 
problem of minimizing the relative entropy subject to matching the moment constraints of the empirical and 
the estimated distributions: 

/f (a;) = arg min KL (f E || q) subj to (2) 

EfE {x) [t (*)] = ^^n) [t (X)} . (3) 

A distribution f E ^a:|A n ^ G ETt satisfying (3) can be found by maximizing the log-likelihood I (A) = 
(A, i * ( a; *)) — 1°§ % W' an d provided t* = Ef,, 1:n -. [t (x)] G rint (conv {%)), a maximum 
likelihood estimate (MLE) A n satisfying (3) exists (Wainwright and Jordan, 2008), and will be unique if 
ET t is minimal (Brown, 1986). If / G £T t , then A n 4 A (van der Vaart, 1998). 

However, if the true distribution does not fall within the chosen exponential family, / G" ETt, the 
estimated model may provide a poor approximation to the true density. As will be illustrated in Section 4, 
for the case of discrete random vectors X from the exponential family with a bounded support %, finding 



For notational convenience, we denote X — x by x. 
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Estimation of t distribution, df=6 




0- -4 -3-2-10 1 2 3 4 

X 

Figure 1: Density estimation from samples from a i-distribution. Black x's are samples; the black solid line 
is the true density, the blue dashed line is the fitted Gaussian density, and the red dashed line is the fitted 
non-parametric Gaussian density with non-parametric exponential family model with Gaussian kernel with 
width 1.5. 

the MLE under the wrong modeling assumption / G S Tt may assign very little probability mass to the 
observed samples x 1:n . 

3 Non-parametric Exponential Family 

In this section, we propose a new family of distributions, a modification to the exponential family STt- 
Our proposed approach modifies the set of features so that the estimated density (or a probability mass 
function for discrete vectors) places approximately the same amount of mass around each sample x % , i = 
1, . . . , n as the empirical distribution. This approach allows using exponential family models to approximate 
distributions outside of the exponential family (e.g., mixtures, heavy-tailed distributions). This approach can 
also be used to avoid degeneracy in cases where the set of features is poorly chosen (e.g., modeling of graphs 
with ERGMs). 

3.1 Motivation 

Suppose a set of samples from an unknown density "looks" Gaussian except perhaps for a few outliers in the 
tails (Figure 1). Should we fit a Gaussian? If not, should we use a non-parametric approach? Our approach 
combines both by using the exponential family with given features (e.g. t (x) = (x, x 2 ) in the case of a 
univariate Gaussian) as a starting point and then adding features for each data point. It draws inspiration 
from KDEs (also known as Parzen windows, Parzen, 1962), 

1 n 

/™ E ( X \x 1:n ) =-J2 R H ( X > ^ Whele 

n i=l 

K H (x; x*) = |H|~s K (H~5 [x - a;*)) . 



3 



K is a univariate kernel function, a bounded probability density function on M.. is a multivariate 
kernel function with a symmetric positive definite bandwidth matrix H; in this paper, we assume H = /i 2 Lj 
(assuming x £ 

The uniform kernel is an indicator function on [—5,5]: 

Kjj (x) = 1 if |x| < -, and otherwise, 

where a multi-dimensional version is a weighted indicator function for the hypercube centered at x l with 
each side equal to \. Most other kernels used with KDE are smooth approximations of Ky, e.g., Gaussian 
kernel Kjy ( x ) = ~^^ e ~ X ^ ''■ KDE matches the mass around each data point (weighted according to the 
kernel) to that of the empirical distribution. Since the empirical distribution approaches the true distribution 
as n increases, the accuracy of KDE approximation improves with the increase in the number of data points 
and the decrease of bandwidth parameter h. The resulting representation however requires keeping all of the 
observations as parameters and requires exponentially many data points in the dimension d to approximate 
the underlying density well. 



3.2 Formulation 



Our approach preserves the mass around each data point by introducing additional moment constraints. 
Let B C X be a region in the support of X, and let Zg (a;) = 1 if x € B, and Zg (x) = otherwise 
denote an indicator function for B. Given metric space (X,a), let Bi = {x € X : a (x i , xj < e} be an 
e-neighborhood of X{. Then the probability mass for density / in the e-neighborhood B% of x% is P {Bi) = 
Ef [Z5J. We propose adding constraints to (3) which would approximately match the probability masses 
for Bi (i = 1, . . . , n) between the empirical and the estimated distributions (/„ and /„ respectively): 



E 



(*)] " E 



/ n (sB|sB 1:n ) 



< Pi 



(4) 



where > determine how closely the masses should match. Similar to KDEs, Zg. in (4) may be replaced 
with a multidimensional kernel Kn (x % ; x) , which assigns decaying importance of mass away from the cen- 
ter (e.g., a smoothed version of Z^). We will use t l a = Ku (x 1 ; x) , and use t a (x) = \t\ (x) , . . . , (x)~\ 
to augment the statistics t in estimating densities. 2 In addition to the canonical parameters A for sufficient 
statistics, we add augmented parameters A a for the augmented statistics t a (x). 
Our proposed density approximation (/„ (au)) is a solution to 



x l - n ) = argminKL (f NE \\ q) subj to 

fNE eJ r 



E fNE( x \ x 1:n) [t (X)\ 



E 



[t (x)} 



(5) 



[t\(xj\ -E L [t a {x)] <0i,i = l, 



•J tNa I v n 1 ^ j I u r 

JnL" v/ J Jn 

fn E falls within the generalized MaxEnt framework (Dudik et al., 2007): 

1 



,n. 



/(*) 



Z(X,X„ 



-q (x) exp [(A, t (x)) + (A a , t a (x))] 



Z(X,X a )= / g(a;)e3q)[(A,t(aj)) + (Ao,*o(a5))]dx. 
Jx 



(6) 



2 We omit h from t\ for the simplicity of notation. It is a tuning parameter that may be set globally for all i = 1, 
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Let s (x) = (t (x) , t a (a?)) and = (A, A a ) be a combined set of statistics and parameters, respectively, for 
the augmented model. A specific set of parameter values for the distribution in (6) satisfying the constraints 
in (5) can be found by maximizing the penalized log-likelihood 

i(e) = /e,-Y, s ( xi ) ) ~ lnZ (*) - E # l A »l ■ ^ 

\ n i=i / i=i 

Note that the distribution 3 satisfying the constraint in (5) will always exist since f n satisfies all of the 
above constraints. 

We refer to the above class of models as non-parametric exponential family models, since the number 
of non-zero parameters = (A, A a ) may increase with the number of available data points. We will 
denote this family by J\fEF s . Clearly STt Q M8F S as all the augmented parameters can be set to 0. Let 
6 n = (^X n , be the MLE of (A, A a ) for the case of n samples. The ^i-penalty in (7) is known to be 

sparsity-inducing (e.g., Bach et al., 2011), so in practice, many of \ l an = 0. 

Note that the framework in (5) allows matching the moments of the estimated distribution to some 
predetermined vector t* £ rint (conv (H)) instead of the empirical moments Ej ^\~\-. n \ \f-^)\ leading 
to the same functional form for the density (6), but with an additional linear term (A, t* — - 2~2i=i * [ x j ) 
in (7) (Dudik et al., 2007). Thus, our non-parametric density estimator is capable of satisfying global 
constraints (matching a set of provided moments, e.g., learning a distribution with a given covariance). We 
are not aware of other non-parametric methods with this capability. 



3.3 Theoretical Properties 

The proofs appear in the Appendix A. 

Theorem 3.1. Suppose a vector of random variables X with support on X has a density f £ £J~t with 

features t : X — > 7~L C M d and a vector of canonical parameters A £ C £ M. d . Suppose 

is a sequence of i.i.d. random vectors drawn from f. Let f^ E ( x\6 n , x l:n I £ M£F S be the MLE solution 



of (5), 6 n = yXn, A a ,n )> with all Pi = /3 > 0, i = 1, . . . ,n. Assuming 

1. X is compact, 

2. t is continuous, 

3. ETt is a family of uniformly equicontinuous functions w.r.t x, 

4. Kernel K has bounded variation and has a bandwidth parameter H such that the series X^nLi e _7n ' H ' 
converges for every positive value ofj, 

then as n — > oo, X a n — >■ 0, Vi = 1, . . . , n and X n — > X. 

Intuitively, uniform convergence is required because we need the n additional constraints be satisfied 
with fixed threshold ft as long as n > N. We use Assumption 2 to relate the pointwise convergence of the 

MLE A in regular exponential families to pointwise convergence of f E ^;r|A n ^. Assumptions 1 and 3 are 

further employed to convert the pointwise convergence to uniform convergence of f E by considering X, C 
to be subsets of the original regular exponential family. 4 Assumption 4 is the requirement used by Nadaraya 
(1965) for the uniform convergence of KDE satisfied by common kernels. Upon these uniform convergence 

results, the augmented constraints will be satisfied by the original MLE estimate A n , . Further, the nature 



3 fn E { x \0, x ) 's functional form depends on both x l n and 9, for convenience, we sometimes omit and x 1:n in notation 
sn referring to f„ E . 

4 For example, Theorem 3.1 applies to the exponential family Af(n, a 2 ) with a £ [a, b],V6 > a > 0, but not if a 2 £ (0,oo). 
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of A n being a maximum entropy solution guarantees that A n , is a maximum entropy solution under the 
additional constraints. 

Theorem 3.1 shows that if the true distribution falls within the exponential family, then as sample size 
increases, the estimated density from the non-parametric exponential family will have vanishing reliance on 
the augmented parameters. 

Theorem 3.2. Given a probability density function f (x) : X — > R, let f^ E (x\9 n , a; 1:n ^ G NEF S be a 
solution satisfying (5). If 

1. f is uniformly continuous on X, 

2- Kh (x) is uniformly continuous on X, 

m 

4. lim K u (x) l\xi = 0, 

1 1 £C 1 1 — >00 j— I 

5. lim |H|3 = 0, 

n— »oo 

1 

6. lim n |H| 2 = oo, 

n— >oo 

pNE ( '\q „l:n\ P, 



then [x\0 n , x j — > f (x) pointwise on X. 

Assumptions 3-6 are required for pointwise convergence of KDE (Parzen, 1962) at specific points x 1:n . 
We then extend this pointwise convergence from x 1:n to X, considering the probability of sampling a new 
x € X far away from existing x 1:n under the true density /. The monotone convergence theorem and the 
uniform continuity assumptions (1 and 2) lead to the pointwise convergence f^ E A- f on X. 

Theorem 3.2 indicates the weak consistency of the non-parametric exponential family density estima- 
tor. Thus our proposed non-parametric approach can be used to approximate densities which are not from 
exponential families. 

3.4 Estimating Parameters for Non-Parametric Exponential Families 

Recently there have been a number of methods developed for optimization of convex non-smooth functions, 
some of them specifically aimed at log-linear problems such as (7) (e.g., Bach et al., 201 1, Shalev-Shwartz 
and Tewari, 201 1, Wu and Lange, 2008). We employed a coordinate descent algorithm similar to the SUM- 
MET algorithm of Dudik et al. (2007) (see Algorithm 1), primarily, due to its simplicity. Other possible 
approaches can be employed as well and may end up more efficient for this formulation. 

The proposed algorithm iterates between optimizing canonical parameters A (by setting [t (a;)] = 
EfNE( x \g{k)\ [t (x)]) and sequentially optimizing the augmented parameters A a so that the Karush-Kuhn- 
Tucker conditions (e.g., Nocedal and Wright, 2006) are satisfied: 



f {A} A* > 0, 
{-ft} A* < 0, 
[(-ft, ft) # = 0. 



(k) 

Algorithm 1 belies the inherent difficulty of: (1) calculation of the partial derivative g: , and (2) an 

implicit search procedure to update A« , both involve calculating intractable integrals. If the support is 
low-dimensional and the mass is contained in a small volume, then the partition function (and thus the 
gradient) can be computed by numerical integration (quadrature). Alternatively, a common approach to 
MLE with an intractable partition function Z (8) is Markov Chain Monte Carlo MLE (MCMC-MLE, Geyer 
and Thompson, 1992). For example, the time complexity at each iteration k is 0{Sn 2 ), where S is the 
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Algorithm 1 Non-Parametric Exponential Family Coordinate Descent 



INPUT: Samples x 1 , . . . , x n S sufficient statistics t : X — >• %, augmented features t\ : % 
i = 1, . . . , n, l\ regularization parameters f3 
OUTPUT: MLE = (A 1 , . . . , X d , A*, . . . , A") 
Initialize (o) 

Compute the sufficient statistics E^^ xl]n ^ [t (»)] 
repeat 

iteration fc = lb + 1, (fc) = fl^" 1 ) 
for i = 1 , . . . , d do 



,(*) 



91 ' = E f n [<* (*)] " ^(^lew) (*)] 

Perform line search along to update A*'W 
end for 

for j ' = 1 ... n do 

Solve two equations for Aa (Aa and Aa' + , respectively): 



E f NE 



E 



choose A^ fc ^ 
choose A^ fc ^ 
choose 
end for 

until convergence 

return (fc) 



ti (x) 


= E f 


ti (x) 


ti (x) 


= E f 

Jn 


ti (x) 



-Pi 



Aj'~if >0 

A J a ' + if A J a '+ < 

otherwise 



number of Monte-Carlo samples we choose to use. However, we believe developments in optimization 
(Bach et al., 201 1, Shalev-Shwartz and Tewari, 201 1, e.g.) will help us find an efficient solution. 

4 Application to Modeling of Graphs 

In this section, we turn our attention to a problem of learning a distribution over X = Q n , a set of undirected 
graphs with n vertices and no self-loops, from a single observed instance G* S Q n , an important branch 
in the analysis of social networks because of complicated relational structure (e.g. Goodreau, 2007). A 
commonly used approach to this problem which arises in the analysis of social networks is to estimate a 
distribution using exponential random graph models (ERGMs, e.g., Handcock, 2003, Robins et al., 2007a,b, 
Wasserman and Pattison, 1996, Wasserman and Robins, 2004). This approach however suffers from the 
model degeneracy, with estimated models placing probability mass on unrealistic graphs (e.g., complete 
or empty) and away from the observed instance. We propose a modification to ERGMs utilizing the non- 
parametric exponential family approach from Section 3 which alleviates the above issue of degeneracy. 

4.1 Exponential Random Graph Models 

An ERGM (or p* model) is an exponential family model over Q n which uses graph statistics as its features 5 . 
These features are typically motivated by the properties of the networks that are of interest to domain sci- 
entists (e.g., sociologists), and may include (among other local and global features) the number of edges 

s sufficient statistics for the exponential family 
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ERGM for (E=22,T=29) 




(a) The graph ^ ^0 

with 22 edges 

and 29 trian- 
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MPERGM for (E=22,T=29) 
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(c) Probability mass for MPERGM 



Figure 2: Degenerate ERGM and Non-degenerate MPERGM. The models are trained based on the observa- 
tion t e (G*) = 22, t&(G*) = 29. The orange x is the observed statistics, and the red + is the mode of the 
learned model. The color bar on the right from red to blue represents the probability mass changing from 
high to low. 



(t e (G) = J2 J2i<i<j< n e ij) and triangles (t A (G) = EE Ei<i<j<fc<n e ij e ikejk), where = 1 if there 
is edge between nodes i and j, and otherwise. The probability mass for a graph G G Q n is defined as 

P(G\X) = ¥ i-exp(A,t(G')>, 

Z(X)=J2 exp(A,t(G)>. (M 

Geg n 

The MLE A makes mean statistics of the distribution match that of the observed graph: E P ( G \ X ) [t (G)] = 
t(G*). 

4.2 Degeneracy 

Let 7i = {t (G) : G G Q n } be the set of all possible values for features. Even though in theory if the feature 
vector for the observed graph is in the relative interior of the convex hull t (G*) G rint (conv {%)), MLE 
A exists (and is unique if the set of features is linearly independent or minimal), in practice ERGMs often 
suffer from degeneracy (Handcock, 2003) manifested in one of the following ways: (1) MLE procedure 
does not converge due to numerical instabilities, and (2) MLE is found, but the resulting probability mass is 
placed mostly on unrealistic graphs (i.e., empty or complete graphs) and little mass is placed in the vicinity 
of the observed graph (around t (G*) in %, c.f. Figure 2). 

We focus on addressing the second type of degeneracy; for more information of the reasons of the first 
type of degeneracy see Handcock (2003), Rinaldo et al. (2009). Several attempts have been made to address 
the second type of degeneracy issue: Handcock et al. (2008) proposed to use domain knowledge specific 
feature sets in addition to edge and triangle features; Hunter and Handcock (2006), Hunter et al. (2008) 
used curved exponential families for ERGMs; Caimo and Friel (2010) suggested Bayesian ERGMs, and Jin 
and Liang (2012) devised an estimation procedure on stochastic approximation with varying truncation in 
parameter space. Lunga and Kirshner (201 1) suggested the degeneracy issue for interior points may be due 
to the bounded support %, and proposed spherical features for modifying the geometry of %. In summary, 
there are two main approaches towards fixing degeneracy: 1) modifying the geometry (Handcock et al., 
2008, Hunter et al., 2008, Lunga and Kirshner, 201 1), and 2) limiting exploration in the canonical parameter 
space (Caimo and Friel, 2010, Jin and Liang, 2012). Our approach belongs in the first category. 
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4.3 Mass-Preserving ERGMs 



To modify ERGMs, we solve the optimization problem in (5) with the uniform base measure q (G) over 
possible graphs G £ Q n . Let t a (G) = Kn (t (G*) ; t (G)), a smoothed mass indicator in the neighborhood 
of the feature values for the observed graph. The solution is an exponential family probability mass function 

/ (G) = zTTxj ex P KA, t (GO) + <A 0; t a (G))] 

Z (A, A ) = ]T exp [(A, t (G)) + (X a ,t a (G))] . 
Geg n 

which we refer to as mass-preserving ERGM (MPERGM). The corresponding objective function 

I (A, A a ) = (A, t (G*)) + (A a , t a (G*)) - In Z (A, A a ) - |A a | 

is concave. 

There are several challenges with parameter estimation, most encountered before in ERGM fitting (e.g., 
Hunter et al., 2008). As in the continuous case, the gradient cannot be computed in closed form except 
for graphs of small size {Q n for n < 11). We therefore apply MCMC-MLE approach of Hunter and 
Handcock (2006), computing Ef[t(G)] in Algorithm 1 as a sampled average ^ Yld=i {pip 1 ')') where 
qI-.s l -^ d j (G\\,\ a ). There are, however, two complications with this approach. One, graph sampling 
from ERGMs is performed using Gibbs sampling and is computationally expensive. Therefore, graphs G 1:S 
are re-sampled only once in several iterations, and reused for other iterations with weights equal to the pos- 
terior probabilities. Two, the resulting distribution over graphs can be multi-modal, and according to Hunter 
and Handcock (2006), Jin and Liang (2012), the sampler can get stuck around the closest mode leading to 
an incorrect estimate of the gradient. Instead of performing line search, we use the direction of the gradient 
with a predefined step-size. 

5 Experimental Evaluation 

5.1 Non-Parametric Exponential Family Density Estimation 

We illustrate the behavior of the proposed non-parametric density estimator matching first and second order 
moment constraints (NPGaussian, i.e. t(x) = (x,x 2 )) in the univariate setting. Normal density (in 6F, 
J\f(0, 1)), mixture of two normals (not in £T, 3, 1) + \N{?>, 1)), and a t-distribution (not in SJ 7 , 

df=6) are used for simulating i.i.d samples. We vary the sample size from 10 to 1000 for training and 
compute the out-of-sample likelihood with an evaluation set of 100000 samples for testing. We compared 
the performance of our non-parametric approach, the model from the true functional family, and another non- 
parametric approach (KDE). There are two sets of tuning parameters, bandwidth h and the box constraint 
parameter j3, assumed to be the same for all i = 1, . . . , n. j3 was set according to a fixed schedule (3(n) = 
0(1/ y/n). h (both for KDE and for our approach) was determined based on cross- validated log-likelihood. 6 
Gaussian kernel function is used for NPGaussian and for KDE. For estimating mixture distribution, the 
estimated NPGaussian model provides an approximation better than KDE, and perhaps not surprisingly, 
better than GMM when the training sample size is small (Figure 3(a)). For estimating normal density, the 
NPGaussian model quickly converges to the normal density as suggested by Theorem 3.1 (Figure 3(b)). We 
also consider the case when the true sufficient statistics are given to us (constrained NPGaussian, CNPG). 

6 Similar to KDE, the choice of kernel width h is important for obtaining good estimates. To test how the non-parametric 
exponential family is affected by the choice of h, we employed the same Gaussian kernel function to do density estimation with 
both KDE and non-parametric Gaussian. It appears that the best bandwidth are different. 



9 



The CNPG model shows improvement over NPGaussian for small n. However, as the training sample 
size increases, both CNPG and NPGaussian show similar performance as the moment constraints t(x) are 
more accurately approximated. We also experimented with 0(1/ log (ra)) regularization schedule for /3s to 
estimate the mixed normal distribution. As n increase, the solution for NPGaussian is too sparse and gives 
a worse performance than KDE (Figure 3(a)). However, it also enjoys a sparse set of augmented parameters 
A a (Figure 3(d)), whereas with schedule 0(l/y/n), NPGaussian keeps adding non-zero A a s. 
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Figure 3: Estimating simple one dimensional densities. Results are averaged over 20 runs. The x axis 
is in log scale, (a) Mixed normal distribution (b) Normal distribution (c) t distribution (d) Number of 
non-zero A a s. In Figure(a,b,c), NPG: NPGaussian with 0(1 /y/n) schedule, NPG-IgN: NPGaussian with 
0(l/log(n)) schedule, CNPG: constrained NPGaussian with true global moment statistics. In (d), NPG-t: 
number of non-zeros for estimating t distribution with NPG, NPG-n: number of non-zeros for estimating 
normal distribution with NPG, NPG: number of non-zeros for estimating mixed normal distribution with 
NPG, CNPG: number of non-zeros for estimating mixed normal distribution with CNPG, NPG-IgN: number 
of non-zeros for estimating mixed normal distribution with NPG-IgN. 



5.2 Modeling Graphs with MPERGMs 

We evaluate the fit of the estimated models by comparing local statistics of the observed graph to that of the 
samples generated from the estimated distribution. 7 

We make use of three sets of local statistics commonly used as goodness-of-fit measures for ERGMs 
Hunter et al. (2008): the degree distribution (the proportion of nodes with exactly k neighbors), edgewise 
shared partner distribution (the proportion of edges joining nodes with exactly k neighbors in common), 
and the minimum geodesic distance (the proportion of connected node-pairs which has a minimum distance 
of k). 

We consider the number of edges and triangles as sufficient statistics, t (G) = (t e (G) , t& (G)). First, 
we consider the toy domain of graphs with 8 nodes, Q%. We enumerate all possible K = 12346 non- 
isomorphic graphs and resulting feature tuples, and compute probability mass entries m, . . . , ttk- We 
trained our MPERGM with a Gaussian kernel function with h = 8, ft = 0.2. Figure 2 shows that MPERGM 
puts larger probability mass around G*. 

We also estimated MPERGMs for several social network data sets, ranging in the number of nodes 
from 16 to 1024, and with varying density of edges. Since the number of nodes n for these graphs are 
too large to enumerate Q n , the graphs are drawn using Gibbs sampler, and the parameters for MPERGMs 
(and ERGMs, using the R package ergm (Hunter et al., 2008)) are estimated using MCMC-MLE. Then 
100 samples were generated using the Markov Chain with learned parameters. For MPERGM, the Markov 

7 See Hunter et al. (2008) for a discussion on the evaluation of fit for social networks. 
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Table 1: Social network data sets. g8: The 8-node graph as in Figure 2(a); Do: The dolphins data set 
(Lusseau et al., 2003); Kp: The Kapferer data set (Hunter et al., 2008); Fl: The Florentine Business data set 
(Hunter et al., 2008); Fa: The Faux.Mesa.High data set (Hunter et al., 2008); Ja: The Jazz data set (Gleiser 
and Danon, 2003); Ad: The AddHealth data set (Harris, 2008); Fb: The Facebook data set (Moreno and 
Neville, 2009); Em: The Email data set (Guimera et al., 2003). 
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Figure 4: Goodness of fit for small graphs. Gaussian kernel functions are used for MPERGM. ERGM is 
shown in blue dashed lines, and MPERGM is shown in red dashed lines. Black lines are the statistics for 
G*, being closer to black line means better fit. 

chain was initialized with the example graph, whereas we are not sure what initial state was used by ergm. 
We then run the chain for a burn-in of 1000 iterations and then use 100 iterations between each draw. The 
100 samples are then used to plot the graph statistics for goodness-of-fit test in Figure 4 and Figure 5. For 
each estimated model, the statistics in Figure 4 & 5 were generated from 100 sampled graphs obtained 
by running Gibbs with 1000 iterations for burn-in and 100 iterations between samples. We initialized our 
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Figure 5: Goodness of fit for large graphs. Gaussian kernel functions are used for MPERGM. ERGM is 
shown in blue dashed lines, and MPERGM is shown in red dashed lines. 

Markov chain with the example graph, whereas we are not sure what initial state did ergm use. We used a set 
of hand-tuned step-size and h for different data sets, and re-scaled the edge and triangle features by a factor 
of frg*y and j^jq*\ ■ Empirically, we find h ~ 8 and a predefined step-size 10 works well for small graphs. 
For graphs with several thousands of nodes, the pre-defined step-size and h needs to be larger to guarantee 
reasonable variance and concentration of the model. In Figure 4, ERGM is degenerate for the Florentine 
and Dolphins dataset, because most sampled graphs have 0-degree nodes (third row), while MPERGM 
is able to generate samples scoring a similar set of graph statistics. In order to investigate the variance 
of the learned MPERGM, we count the number of samples that are different in structure (not counting 
isomorphism) or different in features (number of edges and triangles), while recording the maximum number 
of unique edge-flips needed to get from the initial state to the sampled graph (number of max hops). The 
results in Table 1 suggests that our sampler explores Q n with a considerable range. 

6 Discussion and Conclusions 

We have proposed a non-parametric exponential family model that is capable of approximating distributions 
that do not fall within the exponential family empirically. As the data size increases, this model can approxi- 
mate arbitrary (continuous) densities with tuning parameters controlling the sparsity. And if the true density 
falls within the exponential family characterized by the chosen features, the estimated non-parametric model 
converges to the parametric one. The proposed framework results in a non-parametric density estimator 
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which admits global constraints; if available, this information may require fewer data points to approximate 
the underlying density. The resulting MLE optimization problem is concave with l\ penalty term, but raises 
a computational challenge because the number of inequality constraints is proportional to the number of 
data points. We also adapted the approach to modify exponential random graph models for graphs to come 
up with an exponential family model averse to model degeneracy. 

As future directions, we would like to investigate the rules for selecting bandwidth parameters, the 
acceleration of the optimization problem, and efficient sampling techniques for sampling from the non- 
parametric exponential family. 
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A Proofs 



A.l Theorem 3.1 

To prove Theorem 3.1, let's take a closer look at the augmented constraints. First consider the augmented 
statistics ta* for sample x l . 

Lemma A.l. E fn(x]xl:n) [4 (x)] = DE (x*^) 
Proof. 

1 n 

Ef n{xlx i.. n) [4 (x)] = - £ K n (**; = /™ £ (x^) , 

i=i 



□ 



i.e., the augmented statistics are the KDE estimates for samples x l . . . x" respectively. Parzen (1962) 
proved the pointwise convergence of KDEs and Nadaraya (1965) proved the uniform convergence for KDEs 
under further assumptions. We include the first half of (Nadaraya, 1965, Theorem 1) below since it is 
essential for proving both Theorem 3. 1 and Theorem 3.2. 

Theorem A.2. Let f^ DE (x) = Ef/ y i-.n-\ [fn DE { x \y 1:n )] be the expected value of the KDE density given 
sample points y 1 ' n /. 

Suppose Kh (x) : x 6 X — > R is a function of bounded variation and f (x) is a uniformly continuous 
density function, and the series Y^=i e _7?1 ' H ' converges for every positive value 0/7. 
Then f^ DE (;r|a; 1:n ) a -4' J* DE (x) uniformly on X. 
That is, sup^g^ 



tf DE » -l K n DE \x) 



a 4'o. 



Remark 1. It is helpful to note that given x 1 . . . x n /, 



/n 1 n 

]__[ fix-') A'h (.7:': x') d.r' ... x" 

lit J K H (x i ;x>)f[x>)dxi= J K H (x i ;x)f(x)dx = E f [tl(x)} 



n 

Theorem 3.1 For a density / (a;) E £Ft, with domain space X, feature space 7~L, and canonical parameters 
A G C 6 M d . Suppose x 1 , . . . ,x n = x 1:n is a sequence of independent and identical random vectors 
drawn from /. Let f£ E (x\0 n , x v - n \ £ N£T S be the MLE solution of (5) Q n = (X n , X a ,n), with all 
Pi = P > 0, i = 1, . . . , n. Assuming 

1 . X is compact, 

2. t is continuous, 

3. £Ft is a family of uniformly equicontinuous functions w.r.t x, 

4. Kernel K has bounded variation and has a bandwidth parameter H such that the series Yl^Li e _7n ' H ' 
converges for every positive value of 7, 

then as n — > 00, A* „ — > 0, Vi = 1, . . . , n and X n — > A. 
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Proof. Consider the estimated densities f& I x\X n ) G £P"t and ( a?|0 n , x 1:n ) G M£F S - 



MLE for regular exponential families converges in probability to the true values of parameters, A n — > A. 
(e.g. Wainwright and Jordan, 2008, Chapter 6), (e.g. van der Vaart, 1998, Chapter 5.2) 

Since f (x\X) is continuous w.r.t A, given any x G X, (x\ A n ^ A f (x\X). Because ST± is an 

equicontinuous family, and X is compact, according to (Royden, 1988, Problem 9.6), /jf A / uniformly 
on X. 

Thus given any 8, (3 > 0, there exists N s.t. when n > N, \/x G X, P( ke| A ra ^ — / (a?| A 
5. Then, we have 



>§)< 



P 



%Ha„) [*« (*)] - ^/ [4 (*) 



> 







= P 
< P 

= W 



fE(x\X n )-f(x\X) ti(x)dx> 



sup 

x&X x&X 



fn [ X W 



f(x\X)\t\(x)dx>^ 



sup 

\xex 



f*[x\X n )-f(x\X 



> 



< 5. 



Thus for all x l G X, given any 5\, § > 0, there exist iVi, such that when n > Ni, 

%(x|a w ) K(*)]-^/K(*)]|>§) 

Meanwhile, since we have Ef / x < x v.nj \t l a i x )] f° r tne box constraints in (5), we would like to have 

B /,H^) ft (*)] ft (*)] 

for any x l £ X uniformly as well. This is satisfied under assumptions of Theorem A.2. Then we would 
have 

E Ux\x^) ft (*)] ^ E f ft (x)] 
for alii = 1 . . . n. Because almost sure convergence implies convergence in probability, we have: for all 
x l G X, given any 62, § > 0, there exists N%, s.t. when n > N%, 



P 



E Ux\x^ ft (*)] - E f ft (x)] 



> 



p 



< So 



Thus using Triangle Inequality and Boole Inequality, when n > N = max(A r i, N2), 



P 



E Ux\x^) KW]-%(^) ftO") 



This means that the additional constraints will be matched by [A, 0] in probability. Because X n is the 
solution to (3) without the additional constraints, thus it will have smaller KL divergence (larger entropy) 
than 6 n . Therefore [A n , 0] is bound to be the MLE solution to (5). Since A n A A, we have n A 0, Mi = 
1, . . . , n and X n — > X. □ 

Remark 2. It can be noted that a sufficient condition for £Ft to be an equicontinuous exponential family is 
that X and C are both compact. Since in reality, we rarely deal with probability density functions with infinite 
density values, the conditions for Theorem 3.1, though look restrictive, do not constrain the application 
domain much. However, since for any regular exponential family the canonical parameter space is open 
(Brown, 1986, Theorem 3.6), this means Theorem 3.1 for non-parametric exponential family works only on 
a closed subset C, of the original canonical parameter space. 
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A.2 Theorem 3.2 

To prove the pointwise convergence result for our non-parametric exponential family, we rely on the con- 
vergence of KDEs and Triangle Inequalities. We start off by introducing several lemmas. 

Lemma A.3. Let fn DE (x) = Ej NB t Un ^ xl:n \ [fn DE {xly 1 '™)} be the expected value of the KDE 
density given sample points y 1:n % ~ d f£ E \ x\9 n , x 1:n ^. Then f£ DE (x l ) = E f ^ E ,^ xX , n ^ [t l a (x)]. 

Proof. Similar as Remark 1 , 



fJ DE (x i ) = J \\f^{x^) l -Y,K^{x^)dx\..x- 

X 1 .. .x n 

111 J KH{x i ;x?)tf E {x>)dxi= J K H {x i ;x)f^ E (x\)dx 



n . 

J xif^X xex 
= E f£> E {x) ]fa (x)\ . 

□ 

A pointwise convergence result for the expected KDE density can be found in Parzen (1962): 

Theorem A.4. Suppose f (x) is any probability density function which is continuous at point x°. Let 
f n (x) = Eft y i:n\ [fn DE ( x \y l:n )] be the expected value of the KDE density given sample points 
yl:n j_ 

Suppose Kh (x) : x £ X C M m ->Risd probability density function which satisfies: 



sup Kh (x) < oo, (9) 

x&X 

m 

lim K u (x)T\x i = 0, (10) 

1=1 

lim |H|2 = 0. (11) 

11— >oo 

Then lim /* DE (x°) = f (x°). 

That is, the expected KDE density at continuity point x° converges to the sampling probability density 
function f at x°. 

Corollary A.5. Given a sample x\ lim E f NE( x ) \t l (x)] = f^ E (x l ) if Kh (x) is continuous and 
(9),(10),(1 1) are satisfied. 

Proof. If Kh (x) is continuous, then f^ E satisfies Theorem A.4's conditions for /. Combining Lemma 
A.3, we have then have Corollary A.5. □ 

In addition, we have the mean-square convergence of the KDE density stated in Parzen (1962) as well: 
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Theorem A.6. Suppose (x) : x G X C 
(9),(10),(1 1), also satisfies: 



lim n |H| 2 = oo 



a probability density function which in addition to 

(12) 



and f (x) is probability density function which is continuous at point x°. 
Then lim E f(x i:n) \(f^ DE (a; |a; 1:n ) - / (a; )) 2 ] = 0. 

That is, the KDE density at continuity point x° converges in mean square to the true density f at x°. 
Lemma A.7. 

f» E (*') 4 / (*«) 

if 

1. Kh (x) is continuous, 

2. and (9), ( 1 0), ( 1 1 ), ( 1 2) are satisfied, 

3. and lim = 0, Vi = 1 . . . n. 



Proof. By the Triangle Inequality, 



l/rV)-/(* l )l< fr(*)-f, 



cKDE 



X 



+ 



r 



r 



Fix £, £ > 0. Find iVj using Corollary A.5,iV2 using Theorem A.6, N3 by the schedule of j3 s.t. 



P 



KDE 



tf B M - /if D£ M > e/3j < c/3, 

P(|/™ (O I > £/ 3 ) <CA 

) - (»V :B ) > < C/3, 



for all n > N\, n > N2, and n > N3, respectively. Set N = max {Ni, N2, N3}. Then by Boole Inequality, 
when n > N, 



p(K E M - / (»*) I > e) < p ( f» E M - /r £ M > e/3; 

+ p(|/f D£ (*V :n )-/K)l>£/3) 
+ p 



KDE 



r 



So V£ > 0, lim^ P (|/* E [x*) - f [x 1 ) I > £) = CI. 



x 



) - f* DE ( x *\x^) >e/3 <c. 



□ 



Lemma A. 8. Given any uniformly continuous probability density function f : X 



I, and n samples 



x 



n i.i.d 
, X n ~ 



/. V£ > 0, if we draw a new sample x ~ /, 



lim P 

n— >oo 



f(x)-f f x argmint = 1 - n \\ x - x, \ 



Proof. Since / is uniformly continuous, given any £ > 0, there exists e > 0, s.t. Voj, y G and \\x — y\\ < 



e, we have \f (x) - f (y)\ < £ Therefore, let A be the event f (x) - f [x aTgmhl ^ -« W x 



> £ let 



B be the event 
P 



x — £C ar S lnill «=l-n \\ x ~ xi \ 



> e then Ac B. Thus P(A) < P(B), i.e. 



f(x)-f fa5 argmini=1 -- n W x ~ xi ^ 



>d <P 



^argmin^ n \\x-x l \ 



> e 
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If we divide X into countable number of hypercubes with each side length being e, and index the 
hypercubes as Q\ . . . Qk ■ ■ . . Let pu = J x ^Q k f ( x ) d x - Then the probability for event B to happen is that 
x is the first sample to drop in V/c G Z. That is, P n (B) = PkO- ~ Pk) n - Because pu (1 — Pk) n — > 0, 

and is monotonically decreasing, using monotone convergence theorem, we have lim P n (B) = 0. 

n— >oo 

Therefore, 



lim P 

n— >oo 



f(x)-f ( x * t * iaia *='~«»' B - x ' 



> £ ) < lim P 



x — a .argmin i=1 .. n | 



> £ 



0. 



□ 



Now we are ready to prove Theorem 3.2 with Lemma A.7 and Lemma A.8. 
Theorem 3.2 Given a probability density function / (sc) : X ->• M, let f£ E (x\9 n , x v - n ) G M£F S be a 
solution satisfying (5). If 

1. / is uniformly continuous on X, 

2. Kn (x) is uniformly continuous on X, 

3. and (9),(10),(11),(12) holds, 

4. lim j3i = 0,\/i = 1 . . . n, 
n-+oa 

(NE ( '\a „l:n\ K 



then/^ [x\6 n ,x 



f (x) pointwise on X. 



Proof. By the Triangle Inequality, given any x G X, and f^ E (x) trained on n samples x 1:n *~ /, let 
x — SB* II. 



i' = argmm i=Ln 



cNE 



X 



+ 



|/f w-/w|< /r (<)-/(<) + f? E i*)-ti 

Fix £, C > 0, use Lemma A.7 to find Ni G N and Lemma A.8 to find N 2 , N 3 G N s.t. 

p(\tf E (^')-/(^')|>e/3)<c/3, 
pi rw-/f M >e/3) <ca 



/(*)-/ he* 



/(*)-/(*<) >e/3j <CA 

for all n > Ni, n > N2, and n > N3, respectively. Set N = max {Ni, N2, N3}. Then by Boole Inequality, 
when n > N, 



P(\f? s (x)-f(x)\>t)<P 



f» E (x*)-f[x* 



>£/3 



+ p (jf» E (x) - f»* (V) I > e/ 3 ) + p (|/ (*) - / (V) I > e/a 

SoV£>0, \im n ^ P[\f* E (x)-f{x)\ >e) =0. 



□ 
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